Fundamantals of digital audio

Audio signals

Sound is perceived as rapid changes in air pressure on the eardrum.

Plotting sound pressure against time gives a waveform.

Waveforms can be periodic or non-periodic (noise).

Periodic waveforms have a characteristic frequency, perceived as pitch, measured in cycles per second, or Hertz., and an amplitude envelope, measured in decibels.

The limits of human hearing are about 20 Hz . to about 20,000 Hz., and 0 dB (silence) to about 120 dB.

Apart from the fundamental frequency, a sound can have many others present. A spectrum of a periodic waveform plots amplitude against frequency. Frequencies can be multiples of the fundamental, called harmonics. A pure signal is a sine wave, which has no harmonics, just a fundamental.

A periodic signal can be analysed as the sum of a (possibly infinite) number of sine waves of different amplitude.

The phase of a signal is it starting point on the amplitude axis. Filters use phase shift (delay) to produce frequency-dependent phase cancellation.

Sounds can be captured and manipulated in analog form as voltages (microphone, loudspeaker, amplifier) as grooves on an LP, or as magnetic strength on tape.

Digital Representation

Digital representation translates voltages into a series of equally spaced samples, which can then be stored as numbers in a computer; this is D-to-A conversion. A-to-D conversion takes the stream of numbers, and outputs a varying analog signal (as voltages). Analog filters are used during A-to-D sampling to remove frequencies that cannot be represented (becuase of the limitations of sampling) and during D-to-A output to smooth the analog signal.

Sampling

Digital signals have a sampling rate, measured in Hz, and a resolution, measured by the number of bits used for each sample. A CD uses 16 bit samples (giving over 65,000 possible amplitude levels, and a 44.1 kHz sampling rate.

Aliasing

Aliasing can occur when either a signal is sampled at too low a rate, or the signal being sampled has frequencies present which are too high. When such a signal is sampled, and converted back to analog, the rogue frequencies are folded over as lower frequencies, causing distortion. If we use a sine wave of varying pitch as the analog signal, and use a constant sampling rate, then D-to-A conversion of the sampled signal produces a sine wave of the same frequency as the input only if it is less than half of the sampling rate. Frequencies greater than half the sampling rate are folded over according to the formula fout = sr - fin, where sr is the sampling rate, and fin and fout are the input and output frequencies. Another way to say this is if all the frequencies in the input signal are to be reproduce, than the signal must be sampled at a rate at least twice that of the highest frequency present. The frequency which is half of the sampling rate is the Nyquist frequency.

Since the limit of human hearing is about 20-24 kHz, audio signals have to be sampled at least at 40-48kHz. 48kHz is standard in the recording industry, although 96kHz is becoming more popular, and so-called DVD sound is sampled at 192 kHz. Vinyl LPs can reproduce frequencies up to about 25kHz only if the equipment is top quality; standard equipment tops out at 20 kHz.

Quantization

Quantization noise occurs because of the limited resolution in amplitudes, and the quality of the signal itself. The differences between the original signal and the quantized form presents itself as random noise. In general, the more bits, the less quantization noise. With very low-level signals the quantized signal may be 'square' giving greater distortion due to the high frequency components of a square wave. Dithering, introducing a small amount of analog noise into the original signal can reduce this distortion.

Dynamic range is the ratio of highest to lowest sound pressure level. Decibels measure sound pressure ratios using a log scale: number of decibels = 10 * log(level/reference level). Usually the reference level is the threshold of human hearing, and the ratio is one, so the number of decibels is zero. Differences under about 0.5 decibels cannot be perceived by the human ear. An appoximation to the dynamic range of a digital system is to multiply the number of bits by 6. The human ear has a dynamic range of 120 dB, so 20 bits would be needed to capture sound variations accurately. Most recording is done at 24 bits or above.